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METHOD AND SYSTEM FOR URL-BASED 
SCREENING OF ELECTRONIC COMMUNICATIONS 

REFERENCE TO RELATED APPLICATION 

[OOOIJ This Application claims the benefit of United States Provisional 

Patent Application No. 60/537,921, filed on January 20, 2004, and entitled, 
"Whiplash A New Signature Scheme for SpamNet." 
FIELD OF INVENTION 

[0002] The present invention relates to electronic commimication, and 

more particularly, to screening electronic communication. 
BACKGROUND 

[0003] As the use of electronic communications has become 

increasingly popular for both personal purposes and v^ork related purposes, more 
marketers send spams, to advertise their products and/or services. As used 
herein, the term "spam" refers to electronic communication that is not requested 
and/or is non-consensual. Also known as "unsolicited commercial e-mail" 
(UCE), "unsolicited bulk e-mail" (UBE), "gray mail" and just plain "junk mail," 
spam is typically used to advertise products. The term "electronic 
communication" as used herein is to be interpreted broadly to include any type of 
electronic commimication or message including voice mail communications, 
short message service (SMS) communications, multimedia messaging service 
(MMS) communications, facsimile communications, etc. 
[0004] However, the mass distribution of spams causes many users not 

only nuisance, but costly problems as well. Spams clutter the inboxes of users, 
who has to manually go through the incoming electronic commxmications to 
separate the unsolicited communications from other legitimate communications. 
Furthemiore, spams generate massive amount of useless traffic in the electronic 
communication networked system of many companies, which at best, may slow 
down the delivery of important communications; at worst, may creish the 
networked systems of the companies. 
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[0005] A current way to screen electronic communications is to analyze 

the content of incoming electronic commxHucations. Existing software analyzes 
the message body of incoming electronic communications to generate a number 
of fingerprints or signatures. The message body of a spam typically contains a 
marketing message of the spam sender, who is also known as a spammer. 
However, the spammer may randomly make minor modification in the body of 
the spam such that the fingerprints generated may not recognize the modified 
spam. Therefore, another existing way to screen electronic communications for 
spams applies the similarity algorithm to catch electronic communications having 
content substantially similar to the content of a previously identified spam. 
[0006] However, such content-based screening processes are not typically 

satisfactory because a spammer may randomize the contents of the spams to 
defeat these screening processes. For example, some spams are littered with 
random jmik to avoid detection by the existing content-based screening processes. 
SUMMARY 

[0007] The present invention includes a method and an apparatus to 

screen electronic commxmications. In one embodiment, the method includes 
extracting Uniform Resource Locators (URLs)fi:om electronic communication 
and analyzing the URLs extracted to determine whether the electronic 
communication is of a first predetermined category. 

[0008] In a specific embodiment, the URL includes either a domain name 

(which is a part of or equivalent to a hostname), or an Intemet Protocol (IP) 
address. 

[0009] Other features of the present invention will be apparent from the 

accompanying drawings and firom the detailed description that follows. 
BRIEF DESCRIPTION OF THE DRAWINGS 

[0010] The present invention is illustrated by way of example and not 

limitation in the figures of the accompanying drawings, in which lilce references 
indicate similar elements and in which: 
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[0011] Figure 1 illustrates a flow diagram of one embodiment of a process 

for screening electronic communications; and 

[0012] Figure 2 illustrates one embodiment of a networked system. 

DETAILED DESCRIPTION 

[0013] A method and an apparatus to screen electronic communications 

are described. In one embodiment, the method includes extracting URLs from 
electronic communication and analyzing the URLs extracted to determine 
whether the electronic communication is of a first predetermined category. 
[0014] In the following description, numerous specific details are set 

forth. However, it is understood that embodiments of the invention maybe 
practiced without these specific details. In other instances, well-known 
components, structures, and techniques have not been shown in detail in order not 
to obscure the understanding of this description. 

[0015] Reference in the specification to "one embodiment" or "an 

embodiment" means that a particular feature, structure, or characteristic described 
in connection with the embodiment is included in at least one embodiment of the 
invention. The appearances of the phrase "in one embodiment" in various places 
in the specification do not necessarily all refer to the same embodiment. 
[0016] Figure 1 shows a flow diagram of one embodiment of a process 

for screening electronic conmnmications to identify electronic commimications of 
a first predetennined category. The process is performed by processing logic that 
may comprise hardware (e.g., circuitry, dedicated logic, etc.), software (such as is 
run on a general-purpose computer system or a dedicated machine), or a 
combination of both. 

[0017] Referring to Figure 1, when the electronic communication 101 

(e.g., an email) is received, processing logic extracts URLs from the electronic 
communication 101 (processing block 110). 

[0018] In some embodiments, the first predetermined category of 

electronic communications includes spams. The spammers may use the web 
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pages referenced by the UKLs as landing web pages in the spams. The URL may 
include either a domain name (which is a part of or equivalent to a hostname) 
registered by spammers^ a protocol^ a subsection of a domain relative link, or an 
Internet Protocol (IP) address . For instance, processing logic may search for 
strings that begin in *1ittp://", "https://", "ftp://", or "gopher://", and end in "@", 

"<", or "/" characters. Additionally, processing logic may also extract 
strings containing "www." because strings containing "www." maybe 
automatically converted into clickable URLs by some electronic communication 
software. 

[00191 Furthermore, processing logic may reformat the URLs extracted or 

may retain such information in the original form. For example, hostnames may 
be reduced to lowercase and the leading and training white space is pruned. In 
one embodiment, if a hostname is an IP address, the IP address is retained in the 
original form. The hostname can be canonicalized and reduced to the domain 
name that is registered at a domain registrar. However, if the hostname is a top- 
level domain (TLD) name, then the parts before the second level name may be 
pruned. For example, "m6nb2.pillzthatwork.com" is reduced to 
"pillzthatwork.com," while "203.12.32.106" and "name.sf.ca.us" remains 
unmodified. 

[0020] Referring back to Figure 1, processing logic generates one or more 

derivatives of the hostname extracted (processing block 120). These derivatives 
may be referred to as signatures. For example, processing logic may generate one 
signature based on each URL extracted from the electronic commimication. 
Furthermore, processing logic may generate a unique signature for each unique 
URL. Alternatively, processing logic may generate one signature based on 
multiple URLs extracted from the electronic cormnunication. Furthennore, 
processing logic may generate one or more signatures based on the URLs 
extracted and the length of the electronic cormnunication. 
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[0021] To generate signatxires, processing logic may perform various 

computations or hashing on the URLs extracted. For example, in one 
embodiment, processing logic computes a SHAl hash over the hostname 
extracted and uses the first 48 bits of the hash result as the first part of a 
signature. Processing logic may derive the next 16 bits of the signature firom the 
length of the electronic communication. For example, the length maybe 
computed using the following formulae: 

length — orig_length - (priglengthVolOO\ where % is the remainder of integer 
division; 

length = length < 1007100 ; length, if length is less than 100, then length should 
be set to 100, othemise, the original value of length should be retained. 
In the above example, the resultant length would be a multiple of 100. 
[0022] In one embodiment, the first 16 bits of SHAKJength) are 

concatenated to the 48 bits generated by the SHAl of host to form a 64-bit 
signature of the electronic communication. The 64-bit signature maybe referred 
to as a "Whiplash" in some embodiments. 

[0023] However, one should appreciate that numerous computations may 

be performed to generate the signatures and variations of computation are within 
the scope of the present invention. The examples described above are merely for 
the purpose of illustration. Alternatively, processing logic may use the extracted 
URLs as the signatures. 

[0024] Refening back to Figure 1, processing logic selects one or more of 

the signatures generated (processing block 130). In one embodiment, the 
signatures are selected randomly. Processing logic compares the selected 
signatures against a set of predetermined signatures stored in a number of 
databases (processing block 140). The predetermined signatures stored may be 
generated fi-om various known electronic commimications of the first 
predetermined category reported by users via a collaborative submission 
mechanism. For example, the first predetermined category of electronic 
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communication may include spams and the commimity of users reporting spams 
is SpamNet provided by Cloudmark, Inc. in San Francisco, California. 
Signatures are generated based on the URLs extracted from the reported 
electronic communications, such as domain names, hostnames, and IP addresses 
in the reported electronic communications. 

[0025] In one embodiment, hostname cauonicalization may be performed 

to extract the canonical domain from the hostname such tibiat the extracted domain 
name is part of the host that was registered at a domain registrar. After 
performing hostname canonicalization, selection is performed on the hostnames 
and/or domain names extracted to evaluate whether a particular host or domain is 
suitable for acting as a source for a signature to filter electronic communication of 
the first predetermined category, such as spams. The fact that a domain is 
promoted may not imply that the electronic conmiunication containing the 
domain name is a spam. The determination of whether the electronic 
conmiunication is a spam is derived by the votes from the SpamNet commimity 
in one embodiment. Based on the reports from trusted users on the signatures 
computed on the promoted domains, a domain or host may be determined to be 
providing a landing page for spams. Such determination is also referred to as 
categorization. Some promoted domains may be deemed legitimate by the users 
reporting spams, and hence, these promoted domains are, nevertheless, not used 
for filtering spams. 

[0026] In one embodiment, domain names that contains ".biz" and ".info" 

are promoted. Altematively, signatures representing URLs that contains a 
predetermined string of characters or letters, such as "rx", "herb", "pharm", etc., 
may be promoted. Processing logic may also promote domain names containing 
certain IP addresses. Alternatively, processing logic may promote domain names 
that were registered within a certain period of time, such as the last six month. 
Furthermore, processing logic may demote domain names that contain dictionary 
words. Furthermore, a user may specify a particular domain name or hostname. 
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in addition to the processing logic selected domain names, such that the user 
specified domain name or hostname is used in filtering the incoming electronic 
communications. 

[0027] As discussed above, the predetermined signatures derived from the 

promoted domam names are stored in some databases. In one embodiment, the 
databases storing the predetermined signatures are referred to as catalog 
databases. Furthermore, the databases may be either local or remote. In one 
embodiment, two types of tables are stored in the catalog databases. The first 
type of tables (hereinafter, referred to as the signature tables) store general 
information of the predetermined signatures and the second type of tables 
(hereinafter, referred to as the meta tables) store meta information of the 
predetermined signatures. A signature table may store information on a number 
of predetemiined signatures and a meta value for each predetemiined signature 
that links the predetermined signature to an entry in a meta table. 
[0028] The meta table may contain meta information about the host from 

which the signature was derived. The meta information may include the first 255 
characters of the hostname that was the source of the corresponding signature, the 
WHOIS registration date of the domain part of the hostname, and a selection 
field. The entry in the selection field indicates whether the signatures derived 
from the host can be used for screening electronic communications. The meta 
information may further include the number of trusted reports and revocations for 
signatures based on the host, as well as the number of different signatures created 
on a particular host. 

[0029] Based on the comparison of the selected signatures against the 

predetermined signatures in the databases, processing logic determines whether 
one of the selected signatures matches an entry in the databases (processing bloclc 
150). If there is a match, processing logic identifies the electronic 
communication as an electronic communication of the first predetermined 
category (processing block 160). In one embodiment, processing logic blocks the 
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identified electronic communication. Altematively, processing logic may tag the 
identified electronic commimication or put the identified electronic 
communication into a predetermined location. If there is no match, processing 
logic may pass the electronic commimication (processing block 170). 
[0030] One advantage of screening electronic communications based on 

URLs is to make it harder for spammers to defeat the screening process. Since it 
is a lot more expensive to register a lot of domain names, hostnames, or IP 
addresses as landing web pages than to randomize the contents of electronic 
connanunications, the spammers are less likely to defeat the screening processes 
based on URLs. 

[0031] Figure 2 illustrates one embodiment of a networked system to 

screen electronic communications for electronic communications of a first 
predetemiined category. The networked system 200 includes a server 210, a 
network 220, catalog databases 230, a nomination database 235, a user personal 
computer (PC) 250, and a user email server 260. The server 210, the catalog 
databases 230, the nomination database 235, the user personal computer (PC) 
250, and the user email server 260 are coupled to each other via the network 220, 
which may include a local area network (LAN), a wide area network (WAN), or 
other types of networks. 

[0032] Note that any or all of the components and the associated hardware 

illustrated in Figure 2 may be used in various embodiments of the networked 
system 200. In one embodiment, the networked system 200 may be a distributed 
system. Some or all of the components in the networked system 200 (e.g., the 
catalog database 230) may be local or remote. However, it should be appreciated 
that other configuration of the networked system may include one or more 
additional devices not shown in Figure 2. 

[0033] Users of the networked system may have their PCs, such as the PC 

250, coupled to the network 220 in order to access the catalog databases 230. 
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Alternatively, enterprise users may have their electronic mail servers 260 or 
gateway servers coupled to the network 220 in order to access the databases 230. 
[0034] Users may send reports 240 on electronic communications 

identijSed to be of the first predetermined category to the nomination database 
235 via the network 220. For example, some of these reports 240 may be sent 
from the user PC 250 or the user email server 260. An example of such a 
community of users reporting spams is SpamNet provided by Cloudmark, Inc. in 
San Francisco, California. The server 210 generates signatures of the reported 
electronic commimications based on the URLs extracted from the reported 
electronic commimications, such as domain names, hostnames, and IP addresses. 
The signatures are stored in the catalog databases 230. 

[0035] When a user receives electronic communication, the user PC 250 

or the user email server 260 may extract URLs from the electronic 
communication to generate a number of signatxures. One or more of the 
signatures generated may be selected and compared against the signatures stored 
in the catalog databases 230, If there is a matching signature, then the electronic 
communication is identified to be of the first predetermined category. In one 
embodiment, the electronic communication may be blocked automatically after 
being identified to be of the first predetermined category. Alternatively, the 
identified electronic communication may be tagged. In one embodiment, the 
identified electronic communication is removed from the inbox of the user and 
put into a predetennined location so that the blocked electronic communication is 
not lost. A user may review the blocked electronic communications and decide 
not to block a particular electronic conmiunication, i.e., to unblock the electronic 
commxmication. 

[0036] Some portions of the preceding detailed description have been 

presented in terms of algorithms and symbolic representations of operations on 
data bits within a computer memory. These algorithmic descriptions and 
representations are the tools used by tho^e skilled in the data processing arts to 
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most effectively convey the substance of their work to others skilled in the art. 
An algorithm is here, and generally, conceived to be a self-consistent sequence of 
operations leading to a desired result. The operations are those requiring physical 
manipulations of physical quantities. Usually, though not necessarily, these 
quantities take the form of electrical or magnetic signals capable of being stored, 
transferred, combined, compared, and otherwise manipulated. It has proven 
convenient at times, principally for reasons of common usage, to refer to these 
signals as bits, values, elements, symbols, characters, terms, numbers, or the like. 
[0037] It should be kept in mind, however, that all of these and similar 

terms are to be associated with the appropriate physical quantities and are merely 
convenient labels applied to these quantities. Unless specifically stated otherwise 
as apparent from the following discussion, it is appreciated that throughout the 
description, discussions utilizing terms such as "processing" or "computing" or 
"calculating" or "determining" or "displaying" or the like, refer to the action and 
processes of a computer system, or similar electronic computing device, that 
manipulates and transforms data represented as physical (electronic) quantities 
within the computer system's registers and memories into other data similarly 
represented as physical quantities within the computer system memories or 
registers or other such information storage, transmission or display devices. 
[0038] The present invention also relates to an apparatus for performing 

the operations described herein. This apparatus may be specially constructed for 
the required purposes, or it may comprise a general-purpose computer selectively 
activated or reconfigured by a computer program stored in the computer. Such a 
computer program may be stored in a computer readable storage medium, such 
as, but is not limited to, any type of disk including floppy disks, optical disks, 
CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random 
access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or 
any type of media suitable for storing electronic instructions, and each coupled to 
a computer system bus. 
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[0039] The processes and displays presented herein are not inherently 

related to any particular computer or other apparatus. Various general-purpose 
systems may be used with programs in accordance with the teachings herein, or it 
may prove convenient to construct a more specialized apparatus to perform the 
operations described. The required structure for a variety of these systems will 
appear from the description below. In addition, the present invention is not 
described with reference to any particular programming language. It will be 
appreciated that a variety of programming languages may be used to implement 
the teachings of the invention as described herein. 

[0040] A machine-accessible medium includes any mechanism for storing 

or transmitting information in a fomi readable by a machine (e.g., a computer). 
For example, a machine-readable medium includes read only memory ("ROM"); 
random access memory ("RAM"); magnetic disk storage media; optical storage 
media; flash memory devices; electrical, optical, acoustical or other form of 
propagated signals (e.g., carrier waves, mfrared signals, digital signals, etc.); etc. 
[0041] The foregoing discussion merely describes some exemplary 

embodiments of the present invention. One skilled in the art will readily 
recognize from such discussion, the accompanying drawings and the claims that 
various modifications can be made without departing from the spirit and scope of 
the invention. 
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CLAIMS 

What is claimed is: 

1 . A method comprising: 

extracting URLs from electronic commmiication; and 
analyzing the URLs extracted to determine whether the electronic 
commumcation is of a first predetermmed category. 

2. The method of claim 1, wherein extracting the URLs comprises extracting 
at least one of a hostname, a domain name, a subsection of a domain relative link, 
and an hitemet Protocol (IP) address from the electronic communication. 

3. The method of claim 1, further comprising performing a predetermined 
operation on the electronic communication if the electronic communication is 
determined to be of the first predetermined category. 

4. The method of claim 1, wherein analyzing the URLs comprises: 
generating one or more signatures based on the URLs extracted; 
selecting one or more of the one or more signatures generated; and 
comparing the selected signatures against a plurality of predetermined 

signatures generated from a plurality of known electronic cormnunications of the 
first predetermined category. 

5. The method of claim 4, wherein generating the one or more signatures 
fiirther comprises using a length of the electronic communication to generate the 
one or more signatures. 

6. The method of claim 4, whereia generating the one or more signatures 
fiirther comprises using the extracted URLs as the one or more signatures. 
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7. The method of claim 4, wherein generating the one or more signatures 
further comprises generating the one or more signatures based on at least one of a 
protocol, a hostname, a domain name, a subsection of a domain relative link, and 
an Internet Protocol (IP) address from the electronic communication. 

8. The method of claim 4, further comprising classifying the electronic 
communication to be of the first predetermined category if one of the selected 
signatures matches one of the plurality of predetermined signatures. 

9. The method of claim 4, wherein the plurality of predetemiined signatures 
is derived from a plurality of electronic documents reported via a collaborative 
submission mechanism. 

10. A machine-accessible medium that provides instructions that, if executed 
by a processor, will cause the processor to perform operations comprising: 

generating one or more signatures of electronic conmaunication based on 
URLs in the electronic communication; and 

determining whether the electronic communication is of a first 
predetermined category using the one or more signatures generated. 

1 1 . The machine-accessible medium of claim 10, wherein determining 
whether the electronic communication is of the first predetermined category 
comprises: 

selecting one or more of the one or more signatures generated based on a 
pluraUty of predetermined criteria; 

comparing the selected signatures against a pluraUty of predetermined 
signatures; and 
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classifying the electronic communication to be of the first predetermined 
category if one of the selected signatures matches one of the plurality of 
predetermined signatures. 

12. The machine-accessible medium of claim 11, wherein selecting one or 
more of the one or more signatures generated comprises selecting a signature if 
the signature represents a domain that was registered within a predetermined 
period of time. 

13. The machine-accessible medium of claim 1 1, wherein selecting one or 
more of the one or more signatures generated comprises selecting signatures 
representing one or more of a protocol, a hostname, a domain name, and a 
subsection of a domain relative link having a predetermined string of letters. 

14. The machine-accessible medium of claim 10, wherein the operations 
further comprise extracting the URLs from the electronic commimication. 

15. A system comprising: 

a plurality of databases to store a plxirality of predetermined signatures of 
a plurality of known electronic communications of a first predetermined category; 
and 

a server, coupled to the plurality of databases, including: 

a raemory device to store a plurality of instructions; and 
a processor, coupled to the memory device, to retrieve the plurality 
of instructions from the memory device and to perform operations in response to 
the plurality of instructions, the operations comprising: 

extracting URLs from electronic commumcation to 
generate one or more signatures; and 
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comparing one or more of the one or more signatures 
generated against the pluraUty of predetermined signatures stored in the plurality 
of databases to determine whether the electronic communication is of the first 
predetermined category. 

16. The system of claim 15, wherein the URLs comprises at least one of a 
hostname, a domain name, a subsection of a domain relative link, and an Internet 
Protocol (BP) address. 

17. The system of claim 15, wherein the operations further comprise selecting 
the one or more of the plurality of signatures based on a plurality of 
predetermined criteria. 

18. The system of claim 15, wherein the operations further comprise 
performing a predetermined operation on the electronic communication if the 
electronic commimication is determined to be of the first predetermined category. 

19. The system of claim 15, furttier comprising a database, coupled to the 
server, to store a plurality of reports from which the plurality of predetermined 
signatures are generated. 

20. The system of claim 15, wherein the pluraUty of databases are in a remote 
location from the server. 
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